Project proposal

In the financial fund industry, does green investment generate better performance ?

1) The problem

2) The data

(a) Clear overview of your data

2.a.1 Data origin and gathering

2.a.2 Feature overview

We can have more information provided by FFF about each feature description in the "Key" Excel sheet provided by FFF:

2.a.3 Feature distribution

2.a.4 Data completness

2.a.4.1 Remove non relevant columns

2.a.5 Target selection

The "Financial performance: Month end trailing returns, year 1" is the most complete variable as a high percentage of funds have an inception date greater than 2 years (so performance for "year 1" is available) but less than 10 years (so no performance data available for "year 10")

2.a.6 Data uniqueness

2.a.7 Columns renaming

2.a.8 0-Values

2.a.9 Null values

2.a.10 Duplicate values check

After looking at the raw data, it appears data has been published twice in 2 datasets:

2.a.11 Missing data

With the plot below, we want to verify if there is a data gap:

We observe which funds (shareclassname) have partial information

2.a.11.1 Missing date

2.a.12 Categorical analysis

2.b Process the data

2.b.1 Histogram for numeric data

2.b.2 Continuous features encoding exploration

2.b.2 Best encoding assessment

2.b.3 Pre-processing numerical features

2.b.5 Pre-processing categorical

2.b.4 Remove highly correlated features

2.b Plan to manage and process the data

Data cleaning and data manipulation

Feature engineering

3) Exploratory data analysis (EDA)

(a) Preliminary EDA

3.a.1 Correlation of target with features

Additional information :

3.a.2 Remove trend

3.b) Discuss how the EDA informs your project plan

3.c) What further EDA do you plan for project?

4) Machine learning

(a) Phrase your project goal as a clear machine learning question

(b) What models are you planning to use and why?

(c) Please tell us your detailed machine learning strategy

5.Model fitting

5.1 Features selection

5.2 To get your top feature names

5.3 ML training regressors

5.3.1 RandomForests

5.3.2 kNNs

5.3.3 Neural networks

We have defined here a neural network with the following properties:

5.4 Compare ML with target and conclusion

5.4.1 MAE final comparison

5.4.2 Project conclusion

5.5 Annex - Hyper parameter optimization

5.5.2 NN hyperparameters

It was found that the best combination of hyperparameters is:

5.5 Extra - Calculate MAE KNN with grade model